Introduction

  • The presidential election of 1936 pitted Alfred Landon, the Republican governor of Kansas, against the incumbent President, Franklin D. Roosevelt

  • Literary Digest sent out 10 million "straw"; they received back 2.4 million

  • Straw polls, contrasted with opinion polls, are usually conducted by telephone and based on samples of the voting public

  • They predicted Alf Landon would win 57% to 43%

  • As it turned out, Roosevelt won 62% to 37%!

Two morals of the story:



  • A badly chosen big sample is muc worse than a well-chosen small sample!



  • Watch out for selection bias and non-response bias!

What is a survey?

A survey is the process of collecting data from a sample of a defined population for the purpose of estimating attributes about that population.

Why not conduct a census

Surveys are used instead of collecting data from the entire population (that is, conducting a census) because a survey

  • is less costly

  • can be completed in less time

  • can be more accurate

  • can use specialized data collection methods

  • is the only choice when it is not possible to measure the entire population.






Target vs. sampling populations



Target vs. sampling populations

Multiple choice poll

In most cases, the target population and the sampling population are not exactly the same because of which of the following reasons?

  1. The sampling population might contain units that are not in the target population

  2. The target population might contain units that are not in the sampling population

  3. Both (1) and (2)

  4. Neither (1) nor (2)

Multiple choice poll

In most cases, the target population and the sampling population are not exactly the same because of which of the following reasons?

  1. The sampling population might contain units that are not in the target population

  2. The target population might contain units that are not in the sampling population

  3. Both (1) and (2)

  4. Neither (1) nor (2)

Sampling designs

  • There many types of sampling designs for obtaining data that are scientifically valid

  • We can also obtain standard errors of our estimates

  • Efficient sampling designs can save a lot of time, effort, and money

  • Collecting data takes time and can be costly!

  • Efficient sampling designs allow us to obtain the same level of information with smaller sample sizes

  • A valid sampling plan is needed in order to obtain useful data for analysis

Example

Suppose we wanted to estimate the number of trees in a forest with a particular disease

  • If the forest is large, then it may be impractical to examine every tree in the forest

  • One approach is to divide the forest into plots of a particular size (say 1 acre) and then obtain a random sample of these plots

  • Next, count the number of diseased trees in each sampled plot and from these, obtain an unbiased estimate of the true total

Sampling designs

  • An estimator is unbiased if it does not systematically under- or over-estimate, the true population attribute (e.g., \(\mu\))

  • What does this mean??

  • We can use confidence intervals to access the reliability of our estimates

  • Confidence intervals can be thought of as giving a range of plausible values for the quantity being estimated

  • If one obtains a very small sample size, the resulting confidence interval will likely be too wide and not very informative

  • All else held constant, as sample sizes get larger, confidence intervals get narrower leading to a more precise estimate

Common (and useful) sampling designs

  • Simple Random Sampling (SRS)

  • Stratified Random Sampling

  • Systematic Sampling

  • Two-stage sampling

  • Cluster sampling

Terminology and notation

  • Census: This occurs when one samples the entire population of interest

    • The United States government tries to do this every 10 years. However, in practical problems, a true census is almost never possible

    • In most practical problems, instead of obtaining a census, a sample is obtained by observing the population of interest

    • One must of course determine the population of interest – this is not always an easy problem

  • Element: an object on which a measurement is taken

  • Sampling Units: non-overlapping (usually) collections of elements from the population. In some situations, it is easy to determine the sampling units (e.g., households, hospitals, plots, etc.) and in others there may not be well-defined sampling units (e.g., acre plots in a forest)

  • Frame: A list of the sampling units

  • Sample: A collection of sampling units from the frame

Terminology and notation

  • \(N\)—number of units in the population.

  • \(n\)—sample size (number of units sampled).

  • \(y\)—variable of interest.

  • \(y_i\)—the \(i\)-th observation in a sample or population

Two types of errors

  • Sampling errors: result from the fact that we generally do not sample the entire population. For example, the sample mean will not equal the population mean. This statistical error is fine and expected. Statistical theory can be used to ascertain the degree of this error by way of standard error estimates.

  • Non-sampling errors: is a catchall phrase that corresponds to all other errors, such as non-response and clerical errors. Sampling errors cannot be avoided (unless a census is taken). However, every effort should be made to avoid non-sampling errors by properly training those who do the sampling and carefully entering the data into a database, etc.

Simple random sampling (SRS)

An SRS is the design where each subset of n units selected from the population of size N has the same chance (i.e., probability) of being selected.

Note: Under SRS, each unit in the frame has the same chance of being selected in the sample. However, the converse is not true. That is, there are sampling plans where each unit has the same probability of selection into the sample, but it is not an SRS.

Question

Suppose the frame for the population consists of sampling units labeled A, B, C, and D (i.e., \(N = 4\)) and we wish to obtain a sample of size \(n = 2\).

  1. How many possible random samples of size two are there?

  2. What is the probability of obtaining any one of these samples?

How do we obtain an SRS of size \(n\)?

  1. Label all the sampling units in the population as \(1, 2, \dots, N\)

  2. Pick \(n\) numbers at random from this list (without replacement)

  • This is akin to putting the numbers 1 through \(N\) on a slip of paper, putting them in a hat and then random picking \(n\) slips of paper from the hat

  • Of course, this would be tedious to do in practice—especially if \(N\) is large!

  • Instead, software can be used to generate (pseudo) random samples

  • Many books make use of random number tables, but these are rather archaic!

Obtaining an SRS in R

In R, one can use the function sample to obtain an SRS. For example, suppose the frame has \(N = 100\) units and you want to select an SRS of size \(n = 5\) units. This can be done quite easily in R:

sample(1:100, size = 5, replace = FALSE)
## [1] 44 82  9 85 64
sample(100, size = 5, replace = FALSE)  # equivalent
## [1] 97 22 81 87 23
  • The set.seed function can be used to specify seeds for random number generation!

SRS: estimating the population mean

  • The population mean: \(\mu = \frac{1}{N} \sum_{i = 1} ^ N y_i\)

  • The population variance: \(\sigma ^ 2 = \frac{1}{N} \sum_{i = 1} ^ N \left(y_i - \mu\right) ^ 2\)

  • Estimating the population mean: \(\widehat{\mu} = \bar{y} = \frac{1}{n} \sum_{i = 1} ^ n y_i\)

  • Estimating the population variance: \(\widehat{\sigma} ^ 2 = s ^ 2 = \frac{1}{n - 1} \sum_{i = 1} ^ n \left(y_i - \bar{y}\right) ^ 2\)

SRS: estimating the population mean

  • The sample mean \(\bar{y}\) is an unbiassed estimator for the population mean \(\mu\)

  • The sample variance \(s ^ 2\) is an unbiassed estimator for the population variance \(\sigma ^ 2\)

  • The variance of the sample mean is given by \(var\left(\bar{y}\right) = \frac{\sigma ^ 2}{n}\left(1 - \frac{n}{N}\right)\)

  • Since \(\sigma\) is not generally known in practice, we estimate the variance of the sample mean with \(\widehat{var}\left(\bar{y}\right) = \frac{s ^ 2}{n}\left(1 - \frac{n}{N}\right)\)

  • The factor \(1 - \frac{n}{N} = \frac{N - n}{N}\) is called the finite population correction factor (FPCF)

  • The square root of the variance of \(\bar{y}\) is called the standard error of \(\bar{y}\): \(SE\left(\bar{y}\right) = \frac{s}{\sqrt{n}}\sqrt{1 - \frac{n}{N}}\)

  • In fact, the standard deviation of any statistic is called a standard error

Example

  • Consider two populations of sizes \(N_1 = 1000000\) and \(N_2 = 1000\)

  • Suppose the variance of a variable \(y\) is the same for both populations

  • Which will give a more accurate estimate of the mean of the population: an SRS of size \(1000\) from the first population or an SRS of size 30 from the second population?

  • In the first case, \(1000\) out of a million is \(1 / 1000\)-th of the population. In the second case, \(30 / 1000\) is 3% of the population

  • Surprisingly, the sample from the larger population will be more accurate!

Confidence intervals

A \((1 - \alpha)100\)% confidence interval for the population mean \(\mu\) can be formed using the following formula: \[ \bar{y} \pm t_{1 - \alpha / 2, n - 1} \times \frac{s}{\sqrt{n}}\sqrt{1 - \frac{n}{N}} \] where \(t_{1 - \alpha / 2, n - 1}\) is the \(1 - \alpha / 2\) percentile from a \(t\)-distribution with $n - $ degrees of freedom.

  • In R, we can obtain the value of \(t_{1 - \alpha / 2, n - 1}\) using the qt function. For example, if \(\alpha = 0.1\) and \(n = 100\), then
qt(0.95, df = 99)
## [1] 1.660391

Estimating a population total

  • Oftentimes, we are interested in estimating a population total, say \(\tau\)

  • For instance, we may be interested in knowing how many trees in a forest have a particular disease

  • If the sampling unit is a square acre and the forest has \(N = 1000\) acres, then \(\tau = N \mu = 1000 \mu\). Since \(\mu\) is estimated by the sample mean \(\bar{y}\), we can estimate the population total using \(\widehat{\tau} = N \bar{y}\)

  • The variance of \(\widehat{\tau}\) is given by \[var\left(\widehat{\tau}\right) = var\left(N \bar{y}\right) = N ^ 2 var\left(\bar{y}\right) = N ^ 2 \left(1 - n / N\right) \sigma ^ 2 / n\]

  • What's the estimated standard error of \(\widehat{\tau}\)

  • A \(\left(1 - \alpha\right) \times 100\)% CI for \(\tau\) is given by \[\widehat{\tau} \pm t_{1 - \alpha / 2, n - 1} \times \left(s / \sqrt{n}\right) \sqrt{N\left(N - n\right)}\]

Confidence intervals and sample size

  • How big of a sample should one collect?

  • If \(n\) is too small, the standard errors of our estimates may be too large, making the estimates we obtain essentially useless

  • If \(n\) is too large, we may be collecting to many observations which is wasteful!

  • A statistical can be used to obtain an idea of how large \(n\) should be

  • Confidence intervals can be used to help determine an appropriate sample size

  • When estimating a population total, we would select the smallest \(n\) such that \[n \ge \frac{N \sigma ^ 2 z_{1 - \alpha / 2} ^ 2}{\sigma ^ 2 z_{1 - \alpha / 2} ^ 2 + N d ^ 2},\] where \(d\) is the proposed half width of the CI (specified by the researcher)

Example

  • Suppose a study is done to estimate the number of ash trees in a state forest consisting of \(N = 3000\) acres.

  • A sample of \(n = 100\) one-acre plots are selected at random and the number of ash trees per selected acre are counted.

  • Suppose the average number of trees per acre was found to be \(\bar{y} = 5.6\) with a standard deviation of \(s = 3.2\).
  • Find a 95% confidence interval for the total number of ash trees in the state forest!

  • The estimated total number of ash trees in the forest is \(\widehat{\tau} = N \bar{y} = 3000 \times 5.6 = 16800\).

A word of caution on sample size

Run the student.R file

Estimating a population proportion

  • Consider a situation where for each sampling unit we record a zero or a one, indicating whether or not the sampling unit is of a particular type or not.

  • Such a variable is referred to as a binary variable

  • A very common instance of this type of sampling is with opinion polls (e.g., do you or do you not support candidate X?)

  • Suppose you take a survey of plants and you note whether or not each plant has a particular disease.

  • Interest in such a case focuses on estimating proportion, denoted \(p\), of plants that have the disease.

  • If we obtain a sample of size \(n\) from a population of size \(N\), and each unit in the population either has or does not have a particular attribute of interest (e.g., disease or no disease), then the number of items in the sample that have the attribute is a random variable having a hypergeometric distribution

Estimating a population proportion

  • If \(N\) is considerably larger than \(n\), then the hypergeometric distribution can be approximated by the binomial distribution

  • Consider a sample of size \(n\) from such an experiment: \(y_1, y_2, \dots, y_n\), where \(y_i = 1\) if the \(i\)-th observation has the attribute, and \(y_i = 0\) otherwise

  • The population proportion \(p\) is given by \(p = \frac{1}{N} \sum_{i = 1} ^ N y_i\)

  • We can estimate \(p\) using the sample proportion \(\widehat{p} = \frac{1}{n} \sum_{i = 1} ^ n y_i\)

  • The estimated variance of \(\widehat{p}\) can also be easily derived: \(\widehat{var}\left(\widehat{p}\right) = FPCF \times \frac{\widehat{p}\left(1 - \widehat{p}\right)}{n - 1}\)

  • An (approximate) \(\left(1 - \alpha\right) \times 100\)% confidence interval for \(p\) is given by \(\widehat{p} \pm z_{1 - \alpha / 2} \sqrt{\widehat{var}\left(\widehat{p}\right)}\)

  • Using this formula, sample sizes can also be determined to achieve a particular half width \(d\) (see the formula in the book!)

Stratified random sampling

  • Data is often expensive and time consuming to collect

  • Statistical ideas can be used to determine efficient sampling plans that will provide the same level of accuracy for estimating parameters with smaller sample sizes

  • SRS works just fine, but we can often do better in terms of efficiency!

  • A popular alternative to SRS is stratitfied random sampling

  • Stratified random sampling can lead to dramatic improvements in precision over simple random sampling

  • The idea is to partition the population into \(K\) different strata

  • Ideally, the units within a strata will be more homogeneous

  • For stratified random sampling, one simply applies SRS to each strata

  • The values from each strata can be combined to form estimates for the overall population

Stratified random sampling

  • Several decisions need to be made in order to implement stratified random sampling; namely, how to define the strata and how many strata should be used

  • Often, the population of interest will have naturally occurring strata.

    • Example: In a study of water quality, on could form strata corresponding to lakes in the region of interest

    • Example: If a survey is to be conducted in a city, the different neighborhoods or subdivisions could define the strata

  • In other cases, it may not be clear how to define strata and there could many different possible choices for defining the strata

  • An optimal stratification corresponds to strata defined so that the corresponding estimates have the smallest variance possible

  • Once the strata are defined, a decision has to be made as to how much of the overall sample will be allocated to each!

Stratified random sampling

Stratified random sampling is appropriate when:

  • Separate estimates are needed for each stratum

  • Minimum (positive) sample sizes are required for sub-populations

  • It is necessary to control data collection costs across sub-populations

  • The item of interest is correlated to an auxiliary variable that can be used to stratify the sampling population so as to reduce variance

Three advantages to stratifying:

  1. Parameter estimation can be more precise with stratification (due to smaller within stratum variance).

  2. Sometimes stratifying reduces sampling cost, particularly if the strata are based on geographical considerations.

  3. We can obtain separate estimates of parameters in each of the strata which may be of interest in of itself.

Three advantages to stratifying:

  • The total variability of a stratification variable in a stratified design can be partitioned, using the ANOVA decomposition, into SST = SSB + SSW, where

    • SS is the sums of squared deviations

    • SST is the total variance

    • SSB is the variability between strata

    • SSW is the within-stratum variability

  • The sampling variability of an estimator is the sum of the within-strata variability

  • Between-strata variability attributed to SSB does not contribute to sampling variance

  • Hence, the aim of stratification is to partition the total variability so that as much of the total as possible is due to SSB

  • If you construct your strata so that units within a stratum are less variable than in the overall population (low SSW) and strata means are far apart (high SSB), your design will be more efficient than SRS

    • The largest gains are realized when allocation is optimal

Examples where stratifying may be useful

  • Estimate the mean PCB level in a particular species of fish.

    • What are some possible strata?
  • Estimate the proportion of farms in Ohio that use a particular pesticide.

    • What are some possible strata?

# Load required packages
library(ggplot2)

# Load data for plotting a map of Ohio with counties
oh_state <- subset(map_data("state"), region == "ohio")
oh_county <- subset(map_data("county"), region == "ohio")

# Plot map of Ohio with counties
ggplot(data = oh_state, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray") +
  geom_polygon(data = oh_county, fill = NA, color = "white") +
  geom_polygon(color = "black", fill = NA)

# Sample a few counties
counties <- sort(unique(oh_county$subregion))
length(counties)
sample(counties, size = 10)

## [1] 88
##  [1] "sandusky" "marion"   "jackson"  "athens"   "medina"   "warren"  
##  [7] "lucas"    "geauga"   "highland" "crawford"

Notation and estimators

  • Let \(N_h\) denote the size of the \(h\)-th stratum for \(h = 1, 2, \dots, K\), where \(K\) is the number of strata. Then the overall population size is \[N = \sum_{h = 1} ^ K N_h\]

  • If we obtain an SRS of size \(n_h\) from the \(h\)-th stratum, we can estimate the mean of the \(i\)-th stratum, \(\bar{y}_h\), by simply averaging the data in the \(h\)th stratum. The estimated variance of \(\bar{y}_h\) is \[var\left(\bar{y}_h\right) = \frac{s_h ^ 2}{n_h} \left(1 - \frac{n_h}{N_h}\right),\] where \(s_h ^ 2\) is the sample variance in the \(h\)-th stratum

Notation and estimators

  • The population mean is given by \[\mu = \sum_{h = 1} ^ K N_h \mu_h / N,\] where \(\mu_h\) is the true mean in the \(h\)-th stratum

  • The population mean can be estimated by \[\bar{y}_s = \sum_{h = 1} ^ K N_h \bar{y}_h / N\]

    • We use the subscript \(s\) to denote an estimator obtained using stratified random sampling
  • Derive the estimated standard error of \(\bar{y}_s\)!

  • Note that \(N_h / N\) represents the proportion of observations in the \(h\)-th stratum

Notation and estimators

  • The population total, \(\tau = N \mu\), can be estimated by \[\widehat{\tau}_s = N \bar{y}_s\]

    • Derive the estimated standard error of \(\widehat{\tau}_s\)



  • Approximate \(1 - \alpha\) confidence confidence interval for \(\mu\): \[\bar{y}_s \pm z_{1 - \alpha / 2} \widehat{SE}\left(\bar{y}_s\right)\]

  • Approximate \(1 - \alpha\) confidence confidence interval for \(\tau\): \[\widehat{\tau}_s \pm z_{1 - \alpha / 2} \widehat{SE}\left(\widehat{\tau}_s\right)\]

Post-stratification

  • Sometimes the stratum to which a unit belongs is unknown until after the data have been collected!

    • For example, values such as age or sex, which could be used to form the strata, may not be known until individual units are sampled
  • The idea of post-stratification is to take an SRS first and then stratify the observations into strata after (hence the term post stratification)

  • Once this is done, the data can be treated as if it were a stratified random sample!
  • One difference, however, is that in post-stratification, the sample sizes within each stratum are not fixed ahead of time but are instead random quantities

    • This will cause a slight increase in the variability of the estimated mean (or total)

Allocation in stratified random sampling

  • If a stratified sample of size \(n\) is to be obtained, the question arises as to how to allocate the sample to the different strata

  • In deciding the allocation, three factors need to be considered:

    1. Total number \(N_h\) of elements in each stratum

    2. Variability \(\sigma_h ^ 2\) in each strata

    3. The cost \(c_h\) of obtaining an observation from the \(h\)-th stratum

  • The optimal allocation of the total sample \(n\) to the \(h\)-th stratum is to chose \(n_h\) proportional to \[n_h \propto \frac{N_h \sigma_h}{\sqrt{c_h}}\]

Proportional allocation

  • A simple allocation formula is to use proportional allocation where the sample size allocated to each stratum is proportional to the size of the stratum, more specifically, \[n_h = \left(N_h / N\right) \times n,\] where \(n\) is the overall sample size

  • Proportional allocation is often nearly as good as optimal allocation if the costs and variances at the strata are nearly equal

  • Proportional allocation is a simple procedure that does not require knowledge of the within strata variances \(\sigma_h ^ 2\) (which are usually unknown) or sampling cost within each stratum

Example: Sturgeon sampling

Open sturgeonsampling.R

Stratification for estimating \(\widehat{p}\)

  • A population proportion can be thought of as a population mean where the variable of interest takes only the value zero or one

  • Stratification can be used to estimate a proportion, just as it can be used to estimate a mean

  • The formula for the stratified estimate of a population proportion is given by \[\widehat{p}_s = \frac{1}{N}\sum_{h = 1} ^ K N_h \widehat{p}_h\]

  • The estimated variance of \(\widehat{p}_s\) is given by \[\widehat{var}\left(\widehat{p}_s) = \frac{1}{N ^ 2}\sum_{h = 1} ^ K N_h \left(N_h - n_h\right) \widehat{p}_h \left(1 - \widehat{p}_h\right) / \left(n_h - 1\right)\]

Systematic sampling

Systematic sampling

  • Another popular sampling design is systematic sampling.

  • The idea is to randomly choose a unit from the first \(k\) elements of the frame and then sample every \(k\)-th unit thereafter.

    • This is called a one-in-\(k\) systematic sample.
  • Systematic sampling has two advantages over SRS:

    • It is easier to draw!

    • It distributes the sample more evenly over the listed population.

  • Systematic sampling has built-in stratification: \(1-k, \left(k + 1\right)-2k, \dots\), etc., in effect form strata, one sampling unit being drawn from each (though, not at random!).

  • Systematic sampling often gives substantially more accurate estimates than SRS.

  • Systematic sampling has one disadvantage and one potential disadvatage.

Disadvantage

  • The same formulas used in SRS for estimating the population mean and the population total also apply to systematic sampling.

  • In systematic random sampling, however, there is no reliable way to estimate the standard error of the sample mean (or sample total)!

  • The formulas for stratified random sampling cannot be used (since only one observation is selected).

    • However, if the order of the units in the population are arranged in a random order, then the variance of the sample mean from a systematic sample is the same of the variance from a simple random sample on average.

Potential disadvantage

  • If the population contains a periodic type of variation and if the interval between successive units in the sample happens to equal the wavelength (or a multiple of it), the sample may be badly biased!

    • Imagine using a systematic sample to monitor river water quality by taking samples every seventh day (a 1-in-7 systematic sample).

    • This sampling plan reduces to taking a sample of water on the same day of the week for a number of weeks.

    • If an upstream plant discharges waste on a particular day of the week, then the systematic sample is likely to produce poor estimates.

Two-stage sampling

Cluster sampling

In some situations, the population consists of groups of units that are close in some sense—these groups are called clusters.

  • These groups or clusters are known as primary units.

  • The idea of cluster sampling is to obtain an SRS of primary units and then to sample every unit within the sampled clusters.

  • For example, suppose a survey of schools in the state is to be conducted to study the prevalence of lead paint.

    • One could obtain an SRS of schools throughout the state, but this could become complicated and costly.

    • Instead, one could treat school districts as clusters and obtain an SRS of school districts.

    • Once an investigator is in a particular school district, he/she could sample every school in the district.

  • The number of elements in a cluster should be small relative to \(N\) and the number of clusters should be large.

Cluster sampling

  • \(N\): The number of clusters.

  • \(n\): The number of clusters selected in an SRS.

  • \(M_i\): The number of elements in cluster \(i\).

  • \(M = \sum_{i = 1} ^ N M_i\): The total number of elements in the population.

  • \(y_i\): The total of all observations in the \(i\)-th cluster.



There are two approaches to estimating \(\mu\)and \(\tau\) in cluster sampling!

The straightforward appraoch

  • Just assume the \(y_i\) are an SRS from the primary units: \[\widehat{\tau} = N \bar{y}\] and \[\widehat{var}\left(\widehat{\tau}\right) = \widehat{var}\left(N \bar{y}\right) = \frac{s ^ 2}{n}\left(1 - \frac{n}{N}\right),\] where \[s ^ 2 = \frac{1}{n - 1}\sum_{i = 1} ^ n \left(y_i - \bar{y}\right) ^ 2.\]

  • However, we can do better by taking the size of the individual clusters into account!

Using a ratio estimator

  • Let \(\widehat{\tau}_r = r M\), where \(r = \frac{\sum_{i = 1} ^ n y_i}{\sum_{i = 1} ^ n M_i}\).

  • The estimated variance of \(\widehat{\tau}_r\) is given by \[\widehat{var}\left(\widehat{\tau}_r\right) = \frac{s_r ^ 2}{n}\left(1 - \frac{n}{N}\right),\] where \[s_r ^ 2 = \frac{1}{n - 1}\sum_{i = 1} ^ n \left(y_i - r M_i\right) ^ 2.\]

Cluster sampling

  • Cluster sampling is a special case of two-stage sampling

  • In the second stage of cluster sampling, all the secondary units in selected clusters are sampled.

  • With cluster sampling, in contrast to stratified random sampling, every unit within a selected cluster is sampled!

  • Therefore, to optimize cluster sampling, one would want to choose clusters that are as heterogeneous as possible (otherwise you would be collecting redundant information!)

  • If all the units within a cluster are very similar to each other, then sampling every unit within the cluster is not very efficient.

  • Thus, the goals of choosing strata in stratified random sampling are different from the goals choosing clusters in cluster sampling

The data quality objectives (DQO) process

The U.S. Environmental Protection Agency (EPA) developed the DQO process to ensure a successful data collection process. Details can be found on the web https://www.epa.gov/quality

The steps of the DPO can be summarized as following:

  • State the problem: describe the problem, review prior work, and understand important factors

  • Identify the decision: what questions need to be answered?

  • Identify the inputs to the decision: determine what data is needed to answer questions

  • Define the boundaries of the study: time periods and spatial areas to which the decisions will apply. Determine when and where data is to be gathered

  • Develop a decision rule: define the parameter(s) of interest, specify action limits

  • Specify tolerable limits on decision errors: this often involves issues of type I and type II error probabilities in hypothesis testing

  • Optimize the design for obtaining data: consider a variety of designs and attempt to determine which design will be the most resource-efficient